A graphical review of log-price by make indicates most brands occupy some narrow portion of the price spectrum. A few brands stand out with relatively large pricing ranges (BMW, Mazda, Nissan).
No brands appear to be obvious pricing outliers, however a look at the number of samples per brand shows several brands with small portfolios. These brands will be removed from the subsequent analysis: Renault, Isuzu, Mercury, Chevrolet, Alfa Romeo, Porsche, and Jaguar
The ANOVA analysis of car makes returns a p-value < 0.05 and shows no abnormalities in normality, indicating significant differences in the mean prices of some brands.
Moving on to a more detailed pairwise Tukey analysis, we see statistically significant differences between the means of multiple brands. The largest differences occur between luxury brands like Mercedes, BMW and Volvo contrasted against more economic brands like Dodge, Honda, Mazda, and Plymouth.
Note: Due to the large number of combinations of makes (105), only significant Tukey results are presented below.
[1] "ANOVA Results:"
Df Sum Sq Mean Sq F value Pr(>F)
make 14 26.94 1.9240 23.82 <2e-16 ***
Residuals 168 13.57 0.0808
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"
diff lwr upr p.adj comparison
------- ------- ------- ------ -------------------------
-0.834 -1.350 -0.319 0.000 dodge-audi
-0.793 -1.276 -0.310 0.000 honda-audi
-0.566 -1.031 -0.102 0.004 mazda-audi
0.631 0.103 1.159 0.005 mercedes-benz-audi
-0.691 -1.174 -0.208 0.000 mitsubishi-audi
-0.602 -1.063 -0.141 0.001 nissan-audi
-0.827 -1.371 -0.282 0.000 plymouth-audi
-0.747 -1.237 -0.258 0.000 subaru-audi
-0.622 -1.057 -0.187 0.000 toyota-audi
-0.578 -1.067 -0.089 0.006 volkswagen-audi
-1.176 -1.652 -0.701 0.000 dodge-bmw
-1.134 -1.574 -0.695 0.000 honda-bmw
-0.908 -1.328 -0.489 0.000 mazda-bmw
-1.033 -1.473 -0.593 0.000 mitsubishi-bmw
-0.944 -1.360 -0.528 0.000 nissan-bmw
-0.480 -0.934 -0.025 0.028 peugot-bmw
-1.169 -1.675 -0.662 0.000 plymouth-bmw
-1.089 -1.536 -0.643 0.000 subaru-bmw
-0.964 -1.351 -0.577 0.000 toyota-bmw
-0.920 -1.366 -0.473 0.000 volkswagen-bmw
1.465 0.990 1.941 0.000 mercedes-benz-dodge
0.697 0.257 1.136 0.000 peugot-dodge
0.674 0.159 1.190 0.001 saab-dodge
0.844 0.405 1.284 0.000 volvo-dodge
1.424 0.984 1.863 0.000 mercedes-benz-honda
0.655 0.254 1.056 0.000 peugot-honda
0.633 0.150 1.116 0.001 saab-honda
0.803 0.402 1.204 0.000 volvo-honda
1.197 0.778 1.617 0.000 mercedes-benz-mazda
0.429 0.050 0.807 0.011 peugot-mazda
0.577 0.198 0.955 0.000 volvo-mazda
-1.322 -1.762 -0.883 0.000 mitsubishi-mercedes-benz
-1.233 -1.649 -0.817 0.000 nissan-mercedes-benz
-0.769 -1.223 -0.314 0.000 peugot-mercedes-benz
-1.458 -1.964 -0.951 0.000 plymouth-mercedes-benz
-0.791 -1.319 -0.263 0.000 saab-mercedes-benz
-1.378 -1.825 -0.932 0.000 subaru-mercedes-benz
-1.253 -1.640 -0.866 0.000 toyota-mercedes-benz
-1.209 -1.656 -0.762 0.000 volkswagen-mercedes-benz
-0.621 -1.076 -0.166 0.001 volvo-mercedes-benz
0.554 0.153 0.954 0.000 peugot-mitsubishi
0.531 0.048 1.014 0.017 saab-mitsubishi
0.701 0.301 1.102 0.000 volvo-mitsubishi
0.464 0.090 0.839 0.003 peugot-nissan
0.612 0.238 0.987 0.000 volvo-nissan
-0.689 -1.162 -0.216 0.000 plymouth-peugot
-0.610 -1.018 -0.201 0.000 subaru-peugot
-0.484 -0.826 -0.142 0.000 toyota-peugot
-0.440 -0.849 -0.032 0.022 volkswagen-peugot
0.667 0.122 1.211 0.004 saab-plymouth
0.837 0.364 1.310 0.000 volvo-plymouth
-0.587 -1.077 -0.098 0.005 subaru-saab
-0.462 -0.897 -0.027 0.026 toyota-saab
0.758 0.349 1.166 0.000 volvo-subaru
0.632 0.290 0.974 0.000 volvo-toyota
0.588 0.180 0.997 0.000 volvo-volkswagen
In this section, we generate bootstrapped sample means for each brand and analyze their differences. Individual bootstrapped means look similar to the initial box plots.
Comparison of bootstrapped means returns similar results, with comparisons between expensive brands like Mercedes, BMW and Volvo against Dodge, Mazda, Plymouth and Honda dominating the tails.
Note: Histograms were collapsed into point-range plots for this analysis due to the large number of makes.
A cursory glance at the differences between body styles reveals generous overlap between the inter-quartile ranges of the various body styles. Looking at the number of samples per body style, we see that convertible and hardtop are relatively rare, and will not be included in subsequent analysis.
The ANOVA returns a p-value < 0.05, indicating some body styles command different mean prices. It must be noted that the Q-Q plot shows some non-normality in the lower price quartiles.
Deeper analysis of the Tukey ANOVA reveals a significant difference between the prices of hatchbacks and sedans.
[1] "ANOVA Results:"
Df Sum Sq Mean Sq F value Pr(>F)
body.style 2 3.85 1.9264 9.274 0.000145 ***
Residuals 184 38.22 0.2077
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"
diff lwr upr p.adj comparison
------- ------- ------ ------ ----------------
0.311 0.139 0.482 0.000 sedan-hatchback
0.223 -0.029 0.475 0.094 wagon-hatchback
-0.088 -0.330 0.154 0.668 wagon-sedan
Comparing the bootstrapped means shows a similar conclusion: the 95% CI on the difference between hatchbacks and sedans is between approximately -0.47 and -0.27. Interestingly, 95% CI on the difference between wagons and hatchbacks also indicates a likely difference in price- this is not consistent with the ANOVA.
Graphical analysis of drive wheels seems to indicate RWD cars command a price premium. Though there are relatively few 4WD samples, the sample was still usable.
The ANOVA analysis of drive wheels indicates that some drive configurations are indeed significantly different. No normality issues are apparent.
The Tukey plots reveal significant differences between RWD-4WD and RWD-FWD. 4WD and FWD are not significantly different.
[1] "ANOVA Results:"
Df Sum Sq Mean Sq F value Pr(>F)
drive.wheels 2 23.81 11.903 88.46 <2e-16 ***
Residuals 198 26.64 0.135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"
diff lwr upr p.adj comparison
------- ------- ------ ------ -----------
-0.119 -0.435 0.198 0.649 fwd-4wd
0.599 0.276 0.921 0.000 rwd-4wd
0.718 0.590 0.845 0.000 rwd-fwd
The boostrapped means of drive configurations arrive at the same conclusion- RWD is signigicantly different than both 4WD and FWD, and 4WD-FWD are not significantly different.
Graphically, we can see that only four, five, six, and eight cylinder engines have enough samples to analyze. It appears that four cylinder models occupy the low end of the price range, five and six cylinders the mid range, and eight cylinders the high end.
The ANOVA indeed supports a significant difference between cylinder counts, though some non-normality appears to occur in the upper prices.
The Tukey ANOVA shows that only the five and six cylinder engines are priced similarly. Differences between the four or eight cylinder engines to other configurations are all significant.
[1] "ANOVA Results:"
Df Sum Sq Mean Sq F value Pr(>F)
num.of.cylinders 3 24.81 8.269 67.01 <2e-16 ***
Residuals 191 23.57 0.123
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"
diff lwr upr p.adj comparison
------- ------- ------- ------ -----------
-0.601 -1.140 -0.062 0.022 five-eight
-1.387 -1.848 -0.926 0.000 four-eight
-0.554 -1.046 -0.063 0.020 six-eight
-0.786 -1.083 -0.489 0.000 four-five
0.047 -0.296 0.389 0.985 six-five
0.833 0.633 1.032 0.000 six-four
The bootstrap analysis arrives at the same result, showing the four cylinder at the lower prices, five and six cylinders in the middle, and eight at the top of the price range.
Graphical analysis shows a difference in price by engine type, though with overlap between many types. The rotor and dohcv engines are rare and will not be included in further analysis.
The ANOVA indicates significant price differences between some engine types. The Tukey plot highlights these differences between ohcv-ohc, ohcv-ohcf, and ohc-dohc types.
[1] "ANOVA Results:"
Df Sum Sq Mean Sq F value Pr(>F)
engine.type 4 10.04 2.5103 11.96 1.06e-08 ***
Residuals 192 40.28 0.2098
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Tukey Results:"
diff lwr upr p.adj comparison
------- ------- ------- ------ -----------
-0.181 -0.696 0.333 0.868 l-dohc
-0.487 -0.866 -0.108 0.005 ohc-dohc
-0.416 -0.905 0.072 0.135 ohcf-dohc
0.309 -0.196 0.814 0.447 ohcv-dohc
-0.305 -0.684 0.074 0.177 ohc-l
-0.235 -0.723 0.254 0.677 ohcf-l
0.490 -0.015 0.995 0.062 ohcv-l
0.070 -0.272 0.413 0.980 ohcf-ohc
0.795 0.430 1.161 0.000 ohcv-ohc
0.725 0.247 1.203 0.000 ohcv-ohcf
The bootstrap also picks up on differences between ohcv-ohc, ohcv-ohcf, and ohc-dohc types, but also picks up on differences between dohc-ohcf and, interestingly, sees a separation of the mid-range l type against the lower-priced ohc and the high-priced ohcv types.
#### Load libraries ####
library(dplyr)
library(reshape2)
library(ggplot2)
library(lazyeval)
library(HistData)
library(resample)
library(simpleboot)
library(knitr)
library(RColorBrewer)
######################################################################
## Import and set typing + basic calculated columns
######################################################################
# Import
auto.data <- read.csv('Automobile price data _Raw_.csv', stringsAsFactors = FALSE)
# Set types on factor columns
factor.cols <- c("make", "fuel.type", "aspiration", "num.of.doors",
"body.style", "drive.wheels", "engine.location",
"engine.type", "num.of.cylinders", "fuel.system")
auto.data[, factor.cols] <- lapply(auto.data[factor.cols], as.factor)
# Set types on numeric columns
numeric.cols <- c("wheel.base", "length", "width", "height",
"curb.weight", "engine.size", "bore", "stroke",
"compression.ratio", "horsepower", "peak.rpm",
"city.mpg", "highway.mpg", "price")
auto.data[, numeric.cols] <- lapply(auto.data[numeric.cols], as.numeric)
# Calculate log-price column
auto.data <- auto.data %>% mutate(log.price = log(price))
######################################################################
## Define Functions
######################################################################
#### Graphical analysis function ####
f_graph_analysis <- function(data, dependent, independent) {
## Description: This function takes a dataset with dependent and independent
## variables, and generates a boxplot with independent variables
## ordered by the median of the dependent. Also plots a bar chart
## of counts of each independent variable level.
## Arguments:
## data: A dataframe for plotting
## dependent: character object describing dependent variable column name
## independent: character object describing independent variable column name
# Select columns of interest and remove rows with blanks
plot.data <- data[, c(dependent, independent)]
plot.data <- plot.data[complete.cases(plot.data), ]
# Provide generic column names to standardize analysis pipelines
colnames(plot.data) <- c("dependent", "independent")
# Re-order independent variable factor levels by median price (dependent variable)
median.plot.data <- aggregate(dependent ~ independent, plot.data, median)
plot.data$independent <- factor(plot.data$independent,
levels = median.plot.data[order(median.plot.data$dependent), "independent"])
# Build a boxplot of independent-dependent values
p.boxplot <- ggplot(plot.data, aes(independent, dependent)) +
geom_boxplot() +
labs(title = paste("Box plot of ", independent, " vs. ", dependent, sep =),
x = independent, y = dependent)
print(p.boxplot)
# Plot number of observations of each level of independent variable
p.counts <- ggplot(plot.data, aes(independent)) +
geom_bar() +
labs(title = paste("Counts of ", independent, sep =""),
x = independent, y = "count")
print(p.counts)
}
#### Filtering function ####
f_remove_levels <- function(data, target.column, remove.levels = NULL, na.columns = "log.price") {
## Description: This function removes any levels of an independent variable that
## we want to eliminate before ANOVA or bootstrap analysis. Also removes
## any rows where the dependent variable has a value of NA.
##
## Arguments:
## data: input data frame
## target.column: column to massage
## remove.levels: any levels of the target column we want filtered out (e.g. if a level has too few samples to model)
## na.columns: columns to remove NA values from.
# If remove.levels has been defined, build an index of rows in data that contain those
# values and remove them
if (!is.null(remove.levels)) {
remove.index <- which(data[, target.column] %in% remove.levels)
data <- data[-remove.index, ]
}
# Remove rows containing an NA in na.columns
data <- data[!is.na(data[, na.columns]), ]
# Drop unused factor levels made extinct in the above processing
data <- droplevels(data)
return(data)
}
## ANOVA function
f_anova_analyses <- function(data, formula, truncate = FALSE, suppress.diag.plot = TRUE) {
## Description: Takes dataset and formula and calculates ANOVA and Tukey HSD.
## Returns ANOVA and Tukey tables and plots.
## Arguments:
## data: Dataframe for analysis
## formula: A formula-type object describing the model. ex: formula(y ~ x)
## truncate: Filter printed results to only show significant (p<0.05) results. Use when too many levels are presented to be meaningful.
## suppress.diag.plot: Suppresses output of diagnostic plots from ANOVA. Suppress in reporting when no diagnostic issues are known.
# Calculate basic ANOVA
anova.df <- aov(formula, data = data)
# Print ANOVA table
print("ANOVA Results:")
print(summary(anova.df))
# Set a 2x2 plot layout and print diagnostic plots
# Only executes if suppress.diag.plot == FALSE
if (suppress.diag.plot == FALSE) {
layout(matrix(c(1,2,3,4),2,2))
plot(anova.df)
layout(matrix(c(1),1,1))
}
# Calculate Tukey ANOVA from ANOVA results
tukey.df <- TukeyHSD(anova.df)
# Build data frame from Tukey results for manipulation
tukey.df <- as.data.frame(tukey.df[1])
colnames(tukey.df) <- c("diff", "lwr", "upr", "p.adj") # Re-assert names lost in df translation
tukey.df$comparison <- factor(rownames(tukey.df), levels = row.names(tukey.df)[order(tukey.df$diff)]) # Copy rownames into a column
row.names(tukey.df) <- NULL # Remove row names
tukey.df$significant <- ifelse(tukey.df$p.adj <= 0.05, "yes", "no") # Flag statistically significant rows
# When truncate == TRUE, filter out Tukey results that are not significant
if (truncate == TRUE) {
tukey.df <- tukey.df %>% filter(p.adj <= 0.05)
}
# Print Tukey table. Use kable for nicer looking prints
print("Tukey Results:")
print(kable(tukey.df[, 1:5], digits = 3))
## Plot results
# Create a custom color scale, mapping colors to statistical significance (tukey.df$significant)
myColors <- brewer.pal(5,"Set1")
names(myColors) <- levels(tukey.df$significant)
colScale <- scale_colour_manual(name = "significant",values = myColors)
# Build ggplot of tukey.df. Note the coord_flip to place names on Y axis, and h-line at zero.
p.tukey <- ggplot(tukey.df, aes(x = comparison, y = diff, ymin = lwr, ymax = upr)) +
geom_pointrange(aes(colour = significant)) +
colScale +
coord_flip() +
geom_hline(yintercept = 0) +
labs(title = "Tukey HSD Results", y = "Difference of log-price means")
print(p.tukey)
}
## Bootstrapping functions
f_multilevel_one.boot <- function(data, independent, dependent, type = "mean", R.count = 10000) {
## Description: This function generates bootstrap samples of each level
## of an independent variable (e.g. "ford" and "gm" levels of variable "make").
## Arguments:
## data: dataframe
## independent: independent variable column name
## dependent: dependent variable column name
## type: mean or median bootsrap
## R.count: number of bootstrap samples
# Break out target levels
boot.levels <- levels(data[, independent])
# Bootstrap means/medians of each level of independent variable
boots <- data.frame() # Pre-allocate results dataframe
for (i in 1:length(boot.levels)) {
current.level.data <- data[which(data[, independent] == boot.levels[i]), ] # Data for current level of independent variable
boot.level <- one.boot(current.level.data[, dependent], substitute(type), R = R.count) # Bootstrap R.count iterations of mean/median of current data
boots <- rbind(boots, t(boot.level$t)) # Append results to boots dataframe
}
# Re-orient results and name columns with independent variable levels
boots <- as.data.frame(t(boots))
colnames(boots) <- boot.levels
return(boots)
}
f_multilevel_two.boot <- function(data, independent, dependent, type = "mean", R.count = 10000) {
## Description: This function generates bootstrap samples of DIFFERENCES between each level
## of an independent variable (e.g. "ford" and "gm" levels of variable "make").
## Arguments:
## data: dataframe
## independent: independent variable column name
## dependent: dependent variable column name
## type: mean or median bootsrap
## R.count: number of bootstrap samples
# List out all levels of independent variable
boot.levels <- levels(data[, independent])
# Generate all combinations of levels
# Example:
# 1 2 3
# 1 1,1 1,2 1,3
# 2 2,1 2,2 2,3
# 3 3,1 3,2 3,3
boot.combinations <- expand.grid(boot.levels, boot.levels)
# Remove comparisons of a level to itself
# Example: Remove 1,1 2,2 and 3,3
boot.combinations <- boot.combinations[which(boot.combinations$Var1 != boot.combinations$Var2),]
# Generate numeric columns from factor levels for easy manipulation
boot.combinations$Num1 <- as.numeric(boot.combinations$Var1)
boot.combinations$Num2 <- as.numeric(boot.combinations$Var2)
# Use numeric values to remove duplicate, but reversed comparisons
# Example: 1,2 and 2,1 are the same. Note that in the upper half of the diagonal matrix, the
# row number is always lower than the column number. This makes a simple filtering rule:
# 1 2 3
# 1 1,1 [1,2] [1,3]
# 2 2,1 2,2 [2,3]
# 3 3,1 3,2 3,3
boot.combinations <- boot.combinations[which(boot.combinations$Num1 < boot.combinations$Num2), ]
# Remove numeric columns now that we're done filtering.
boot.combinations <- boot.combinations %>% select(Var1, Var2)
# Generate bootstraps for the pairs.
boots <- data.frame()
for (i in 1:nrow(boot.combinations)) {
sample.a <- data[which(data[, independent] == boot.combinations$Var1[i]), ]
sample.b <- data[which(data[, independent] == boot.combinations$Var2[i]), ]
two.boot.temp <- two.boot(sample.a[, dependent], sample.b[, dependent], substitute(type), R = R.count)
boots <- rbind(boots, t(two.boot.temp$t))
}
# Re-orient results and add column names
boots <- as.data.frame(t(boots))
colnames(boots) <- paste(boot.combinations$Var1, boot.combinations$Var2, sep = ".")
return(boots)
}
## Bootstrap Plotting functions
plot.hist <- function(a, maxs, mins, cols = 'difference of means', nbins = 80, p = 0.05) {
## Description: Histogram plotting function copied from DS350 course code.
breaks = seq(maxs, mins, length.out = (nbins + 1))
hist(a, breaks = breaks, main = paste('Histogram of', cols), xlab = cols)
abline(v = mean(a), lwd = 4, col = 'red')
abline(v = 0, lwd = 4, col = 'blue')
abline(v = quantile(a, probs = p/2), lty = 3, col = 'red', lwd = 3)
abline(v = quantile(a, probs = (1 - p/2)), lty = 3, col = 'red', lwd = 3)
}
## Plot multiple histograms for bootstrap results on multiple levels
f_multi_hist <- function(data, simplify = FALSE, nbins = 80, p = 0.05,
plot.title = "Histogram", y.label = "mean log-price", x.label = "Level"){
## Description: Builds multiple histograms of bootstrap results for variables with multiple levels.
## There is a "simplify" option to collapse histograms into simpler point-ranges for
## cases where too many levels (more than ~5) generate an overwhelming number of histograms.
##
## Arguments:
## data: Bootstrap results
## simplify: Generates point-ranges instead of histograms when TRUE
## nbins: Number of histogram bins
## p: Confidence interval threshold
## plot.title: Plot title
## y.label: X label
## x.label: Y label
# Get max and min values for plot limits
maxs = max(data)
mins = min(data)
# Show histograms when simplify == FALSE
if (simplify == FALSE) {
# Use par to merge multiple plots into a single plot window. Use ncol(data) to make as many plots as there are levels.
par(mfrow = c(ncol(data), 1))
# Generate each individual plot and add to the plot layout initiated in the previous line.
for (i in 1:ncol(data)) {
plot.hist(data[, i], maxs, mins, cols = colnames(data)[i])
}
title(plot.title, outer = TRUE, cex.main = 2)
# Reset plot window to 1x1 for future plots
par(mfrow = c(1, 1))
}
# Condense into pointrange plots when simplify == TRUE
if (simplify == TRUE) {
# Calculate mean, upr, lwr values
mean <- apply(data, 2, mean)
upr <- apply(data, 2, function(x) quantile(x, probs = 1 - p/2))
lwr <- apply(data, 2, function(x) quantile(x, probs = p/2))
# Combine basic stats into a dataframe
plot.data <- data.frame("name" = colnames(data), mean, upr, lwr)
# Make name column a factor type, and order factor levels by their means for ordered plotting
plot.data$name <- factor(plot.data$name, levels = plot.data$name[order(plot.data$mean)])
# Generate pointrange plot of bootstrap results
p.multi <- ggplot(plot.data, aes(x = name, y = mean, ymin = lwr, ymax = upr)) +
geom_pointrange() +
coord_flip() +
geom_hline(yintercept = 0) +
lims(y = c(min(plot.data$lwr), max(plot.data$upr))) +
labs(title = plot.title, y = y.label, x = x.label)
print(p.multi)
}
}
######################################################################
## Analyses
######################################################################
### Make
# Graphical Analysis
f_graph_analysis(data = auto.data, dependent = "log.price", "make")
# Remove unusual levels
make.data <- f_remove_levels(auto.data, "make", remove.levels = c("renault", "isuzu", "mercury", "chevrolet", "alfa-romero", "porsche", "jaguar"))
# ANOVA
f_anova_analyses(make.data, formula(log.price ~ make), suppress.diag.plot = FALSE)
# Bootstrap means
make.one.boot <- f_multilevel_one.boot(make.data, "make", "log.price")
f_multi_hist(make.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by make")
# Bootstrap mean differences
make.two.boot <- f_multilevel_two.boot(make.data, "make", "log.price")
f_multi_hist(make.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by make", y.label = "Mean log-price difference")
#### Body Style
# Graphical Comparison
f_graph_analysis(data = auto.data, dependent = "log.price", "body.style")
# Remove unusual levels
body.data <- f_remove_levels(auto.data, "body.style", remove.levels = c("convertible", "hardtop"))
# ANOVA
f_anova_analyses(body.data, formula(log.price ~ body.style), suppress.diag.plot = FALSE)
# Bootstrap means
body.one.boot <- f_multilevel_one.boot(body.data, "body.style", "log.price")
f_multi_hist(body.one.boot, plot.title = "Bootstrap means of log-price by body style")
# Bootstrap mean differences
body.two.boot <- f_multilevel_two.boot(body.data, "body.style", "log.price")
f_multi_hist(body.two.boot, plot.title = "Difference of bootstrap mean log-price by body style", y.label = "Mean log-price difference")
#### Drive Wheels
# Graphical
f_graph_analysis(data = auto.data, dependent = "log.price", "drive.wheels")
# Remove unusual levels
drive.data <- f_remove_levels(auto.data, "drive.wheels")
# ANOVA
f_anova_analyses(drive.data, formula(log.price ~ drive.wheels), suppress.diag.plot = FALSE)
# Bootstrap means
drive.one.boot <- f_multilevel_one.boot(drive.data, "drive.wheels", "log.price")
f_multi_hist(drive.one.boot, plot.title = "Bootstrap means of log-price by drive wheels")
# Bootstrap mean differences
drive.two.boot <- f_multilevel_two.boot(drive.data, "drive.wheels", "log.price")
f_multi_hist(drive.two.boot, plot.title = "Difference of bootstrap mean log-price by drive wheels", y.label = "Mean log-price difference")
#### Cylinder Count
# Compare variable levels graphically
f_graph_analysis(data = auto.data, dependent = "log.price", "num.of.cylinders")
# Remove unusual levels
cylinder.data <- f_remove_levels(auto.data, "num.of.cylinders", remove.levels = c("two", "three", "twelve"))
# ANOVA
f_anova_analyses(cylinder.data, formula(log.price ~ num.of.cylinders), suppress.diag.plot = FALSE)
# Bootstrap means
cylinder.one.boot <- f_multilevel_one.boot(cylinder.data, "num.of.cylinders", "log.price")
f_multi_hist(cylinder.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by cylinder count")
# Bootstrap mean differences
cylinder.two.boot <- f_multilevel_two.boot(cylinder.data, "num.of.cylinders", "log.price")
f_multi_hist(cylinder.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by cylinder count", y.label = "Mean log-price difference")
#### Engine Type
# Compare variable levels graphically
f_graph_analysis(data = auto.data, dependent = "log.price", "engine.type")
# Remove unusual levels
engine.type.data <- f_remove_levels(auto.data, "engine.type", remove.levels = c("dohcv", "rotor"))
# ANOVA
f_anova_analyses(engine.type.data, formula(log.price ~ engine.type), suppress.diag.plot = FALSE)
# Bootstrap means
engine.type.one.boot <- f_multilevel_one.boot(engine.type.data, "engine.type", "log.price")
f_multi_hist(engine.type.one.boot, simplify = TRUE, plot.title = "Bootstrap means of log-price by engine type")
# Bootstrap mean differences
engine.type.two.boot <- f_multilevel_two.boot(engine.type.data, "engine.type", "log.price")
f_multi_hist(engine.type.two.boot, simplify = TRUE, plot.title = "Difference of bootstrap mean log-price by engine type", y.label = "Mean log-price difference")